Spam filters: bayes vs. chi-squared; letters vs. words
نویسندگان
چکیده
We compare two statistical methods for identifying spam or junk electronic mail. Spam filters are classifiers which determine whether an email is junk or not. The proliferation of spam email has made electronic filtering vitally important. The magnitude of the problem is discussed. We examine the Naive Bayesian method in relation to the ‘Chi by degrees of Freedom’ approach, the latter used in the field of authorship identification. Both methods produce very promising results. However, the ‘Chi by degrees of Freedom’ has the advantage of providing significance measures, which will help to reduce false positives. Statistics based on character-level tokenization proves more effective than word-level.
منابع مشابه
Searching for Interacting Features for Spam Filtering
In this paper, we propose a novel feature selection method— INTERACT to select relevant words of emails for spam email filtering, i.e. classifying an email as spam or legitimate. Four traditional feature selection methods in text categorization domain, Information Gain, Gain Ratio, Chi Squared, and ReliefF, are also used for performance comparison. Three classifiers, Support Vector Machine (SVM...
متن کاملGood Word Attacks on Statistical Spam Filters
Unsolicited commercial email is a significant problem for users and providers of email services. While statistical spam filters have proven useful, senders of spam are learning to bypass these filters by systematically modifying their email messages. In a good word attack, one of the most common techniques, a spammer modifies a spam message by inserting or appending words indicative of legitima...
متن کاملExploration of Neuro-Fuzzy Spam Filtering based on Naive Bayes Filters
A text parser was used to calculate the statistical distribution of words within an email body. This information was used by a neurofuzzy system to determine the spam classification of the email. This process of detecting spam in an email was experimentally found to be 90% efficient. This design is exceptionally good as compared to present day filters based on its simplicity and limited scope o...
متن کاملNaive Bayes Spam Filtering Using Word-Position-Based Attributes
This paper explores the use of the naive Bayes classifier as the basis for personalised spam filters. Several machine learning algorithms, including variants of naive Bayes, have previously been used for this purpose, but the author’s implementation using wordposition-based attribute vectors gave very good results when tested on several publicly available corpora. The effects of various forms o...
متن کاملNaive Bayes Spam Filtering Using Word Position Attributes
This paper explores the use of the naive Bayes classifier as the basis for personalized spam filters. Various machine learning algorithms, including variants of naive Bayes, have previously been used for this purpose, but the author’s implementation using word position based attribute vectors gives very good results when tested on several publicly available corpora. The effect of various forms ...
متن کامل